feat(page-cluster): add frequency-based template/content token split by YusukeHirao · Pull Request #901 · d-zero-dev/tools

YusukeHirao · 2026-07-03T09:10:17Z

Summary

Add computeDocumentFrequency() and splitTokensByFrequency() to @d-zero/page-cluster: a preprocessing layer that separates a page's shared site chrome (header/nav/footer) from page-specific content by document frequency, before either half is compared with jaccardSimilarity() (PR feat(page-cluster): add Jaccard similarity and array edit distance primitives #900, still open).
A single flat Jaccard over a page's full token set has two failure modes: common chrome dilutes genuine content differences at loose similarity thresholds, and page-specific content variation (e.g. a freeform CMS block-editor page) swamps a real layout match. Splitting first and comparing template/content axes separately fixes both — no tag-name or class-name semantics involved, purely document-frequency statistics.
Validated against two real crawls (a small single-layout corporate site, and a much larger site that turned out to be a federation of independent sub-sections with no single dominant layout): the split works cleanly on a homogeneous corpus, and requires scoping to one section first on a heterogeneous multi-section site.
Scope intentionally excludes the classifier core (MinHash/LSH, medoid clustering) and auto-discovery of homogeneous page groups for large multi-section sites — both still need separate design decisions.

Test plan

yarn build (28 projects)
yarn lint
yarn test (1130 tests)
/code-review xhigh — 9 findings (all in the frequency-cutoff comparison: unvalidated threshold, floating-point boundary rounding, pageCount/documentFrequency desync risk), all fixed
/qa-engineer — no additional findings

Note

PR #900 (jaccardSimilarity/arrayEditDistance) is still open/unmerged; this branch has no compile-time dependency on it, but the two PRs are conceptually part of the same classifier-core preprocessing layer.

🤖 Generated with Claude Code

Add computeDocumentFrequency() and splitTokensByFrequency(), a preprocessing layer that separates a page's shared site chrome (header/nav/footer) from its page-specific content by document frequency, before either half is compared with jaccardSimilarity(). A single flat Jaccard over a page's full token set has two failure modes: common chrome dilutes genuine content differences at loose similarity thresholds, and page-specific content variation (e.g. a freeform CMS block-editor page, where the exact block mix differs per page) swamps a real layout match. Splitting first and comparing each axis separately fixes both. Validated against two real crawls: a small single-layout corporate site (a few hundred pages) showed a clean bimodal frequency split stable across a wide threshold range; a much larger site that turned out to be a federation of independent sub-sections (no single section covering even half the pages) showed the split requires a homogeneous input, and recovers cleanly once scoped to one section. code-review (xhigh) surfaced 9 findings, all in the frequency-cutoff comparison: unvalidated threshold allowing degenerate cutoffs (0, NaN, or a percentage instead of a fraction), floating-point rounding at the documented inclusive boundary, and pageCount being passable out of sync with the documentFrequency it was computed from. Fixed by validating threshold eagerly, applying an epsilon tolerance to the boundary comparison, and bundling pageCount with documentFrequency into one DocumentFrequency result so they cannot be passed independently.

…uency-split # Conflicts: # cspell.json

YusukeHirao requested a review from yusasa16 as a code owner July 3, 2026 09:10

YusukeHirao force-pushed the feat/page-cluster-frequency-split branch from c9323a9 to 68acfe8 Compare July 3, 2026 09:18

Merge remote-tracking branch 'origin/dev' into feat/page-cluster-freq…

57a0057

…uency-split # Conflicts: # cspell.json

YusukeHirao merged commit b233683 into dev Jul 3, 2026
6 checks passed

YusukeHirao deleted the feat/page-cluster-frequency-split branch July 3, 2026 09:44

This was referenced Jul 3, 2026

feat(page-cluster): add URL-path and stylesheet blocking-key derivation #902

Merged

feat(page-cluster): resolve which blocking key to use per page #903

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(page-cluster): add frequency-based template/content token split#901

feat(page-cluster): add frequency-based template/content token split#901
YusukeHirao merged 2 commits into
devfrom
feat/page-cluster-frequency-split

YusukeHirao commented Jul 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

YusukeHirao commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YusukeHirao commented Jul 3, 2026 •

edited

Loading